Abstract
The status of archaeology as a science has been debated for decades and influences how we practice and teach archaeology. This study presents a novel bibliometric assessment of archaeology’s status relative to other fields using a hard/soft framework. It also presents a systematic review of computational reproducibility in published archaeological research. Reproducibility is a factor in the hardness/softness of a field because of its importance in establishing consensus. Analyzing nearly 10,000 articles, I identify trends in authorship, citation practices, and related metrics that position archaeology between the natural and social sciences. A survey of reproducibility reviews for the Journal of Archaeological Science reveals persistent challenges, including missing data, unspecified dependencies, and inadequate documentation. To address these issues, I recommend to authors basic practical steps such as standardized project organization and explicit dependency documentation. Strengthening reproducibility will enhance archaeology’s scientific rigor and ensure the verifiability of research findings. This study underscores the urgent need for cultural and technical shifts to establish reproducibility as a cornerstone of rigorous, accountable, and impactful archaeological science.
In their paper celebrating the 40th anniversary of this journal Torrence et al. (2015) noted that reproducibility was an issue important to the reputation and sustainability of the discipline, and necessary for archaeological science to behave like a science. As part of the celebration of the 50th anniversary, and of Torrence’s leadership of the journal, my contribution revisits these topics of archaeology’s status as a science, this journal’s place in the landscape of archaeological science, and how the journal has responded to a growing recognition of the importance of reproducibility. I first present bibliometric evidence of the position of archaeology as a whole, and this journal in particular, in the sciences. Next, I report on the journal’s progress in supporting reproducible research, and my work doing a new kind of peer review for JAS, one that evaluates the computational reproducibility of the research submitted for publication. Finally, I analyse twelve months of reproducibility reviews to identify common weaknesses in the ways archaeologists are working currently, and provide simple recommendations for researchers to overcome these and contribute to the improvement of computational reproducibility in archaeological science.
The question of archaeology’s status as a science usually comes up in the context of what the discipline should or should not be. One of the first landmarks in tackling this question is the debate published by Antiquity between classical archaeologist Jacquetta Hawkes and palaeoanthropologist Glynn Isaac. Hawkes (1968), advocating a humanistic archaeology, was concerned that scientific approaches to archaeology were causing researchers to be “swamped by a vast accumulation of insignificant, disparate facts, like a terrible tide of mud, quite beyond the capacity of any man to contain and mould into historical form”. More optimistic about the integration of science and archaeology, Isaac (1971) counters that “New levels of precision in presenting data and in interpreting them can surely lead to briefer and more interesting technical reports as well as providing the basis for more lively literary portrayals of what happened in prehistory. Expanding on Isaac’s perspective, Binford (1962) argued that archaeology should operate as a science after the model proposed by philosopher Carl Hempel, which prescribed hypothesis-driven approaches, leading to generalizable laws of human behavior. Drawing on a different group of philosophers, Smith (2017) argues for archaeology more specifically as a social science. Bevan (2015) proposes that floods of digital data are reconfiguring our analytical agendas and support empirical and inductive inference. Counter-arguments to archaeology as a science come from numerous directions, notably Hodder (1985) who rejected the quest for generalisations and instead argued that archaeology should be subjective and reflective, focussed on symbolic and relational meanings of material culture and the historical particularity of past human cultures. These debates, and the many more similar ones summarised by Martinón-Torres and Killick (2013), have become a genre in archaeological writing that can be characterized as mostly based on personal observations, microscopic dissections of a handful of cherry-picked case studies of good or bad practice, and discussion of various philosophers and sociologists.
What has been missing from these debates is a macroscopic observation of what the majority of archaeologists are actually doing, and an empirical comparison to a broad spectrum of relatively harder and softer disciplines. At the ‘hard’ end of the spectrum (e.g. physics and chemistry), scholars more typically share a large set of established set of theories, facts, and methods, facilitating fairly rapid agreement on the validity and significance of new results (Biglan, 1973a). At the ‘soft’ end of the spectrum (e.g. economics and psychology), the set of theories, facts, and methods on which there is widespread consensus is smaller, and agreement is slower and less frequently reached about the significance of new findings and the continuing relevance of previous work. In sum, the hard-soft status is defined by the amount of consensus in a field, and the speed at which consensus is reached on new knowledge (Fanelli and Glänzel, 2013). Hardness and softness is a controversial distinction, in part because it is sometimes used to imply a rank order of disciplines that encodes legitimacy, productivity, perceived value to society, and worthiness of funding (Cole, 1983; Editors, 2012). Another criticism is that it may be more of an emergent product of social and institutional processes rather than intrinsic differences in method or consensus (Latour, 1987). On the other hand, analyzing the characteristics that lead to the hard-soft distinction can be useful for understanding the diversity of academic inquiry, such as how different fields approach knowledge and differences in what counts as evidence and modes of argument, where fruitful collaborations might be possible due to shared methods and assumptions, and for curriculum design to structure courses appropriately based on a field’s typical ways of knowing (Becher and Trowler, 2001).
Independent of these value judgments, empirical analysis of scholarly articles does support the hard-soft concept as a spectrum of variation in practice linked to differing degrees of consensus in a discipline, for example in approaches to data visualisation (Cleveland, 1984; Smith et al., 2000). Similarly, quantitative analysis of the frequency of positive results (ie. full or partial support for a research hypothesis) in publications is significantly correlated with hardness, consistent with a model where researchers in harder fields more readily accept any result their research produces, while those in softer fields have more freedom to choose which theories and hypotheses to test and how to interpret results (Fanelli, 2010). The hard-soft spectrum is also evident in surveys of how researchers view their own work relative to those in other fields (Biglan, 1973b).
To objectively quantify the diversity of modern archaeological practice across a scale of relative hardness or softness, as an evaluation of its status as a science, and the place of this journal in context of other archaeology journals, I take a bibliometric approach. This approach is based on Fanelli and Glänzel (2013), who examined the hardness and softness of 12 disciplines using scholarly publication parameters. Fanelli and Glänzel (2013) found a spectrum of statistically significant variation in bibliometric variables from the physical to the social sciences, with papers at the softer end of the spectrum tending to have fewer co-authors, use less substantive titles, have longer texts, cite older literature, and have a higher diversity of sources. In Fanelli and Glänzel’s (2013) analysis harder sciences include Space Science, Physics, Chemistry, softer sciences include social sciences (Psychiatry, Psychology, Economics, Business, and General Social Sciences), and the Humanities define the soft end of the spectrum. Following Fanelli and Glänzel (2013), I quantify the number of authors, length of article, relative title length, age of references, and diversity of references for a large sample of peer-reviewed journal articles.
These parameters are useful because of how they signify consensus in a research community. A larger number of authors on a paper reflects collaboration of people working together on a common goal. Collaborators have specialized roles, each of whom has the ability to study a part of the problem with high accuracy and detail, with harder fields having larger groups of collaborators (Zuckerman and Merton, 1972). Reflecting this collaboration group size, harder disciplines tend to have higher average numbers of authors on papers. Article length has an inverse correlation with field hardness. In low-consensus, or softer, fields, papers must be longer to present justification, nuance and contextualization of results. While article length is constrained by journal requirements, leaving individual authors with little freedom to vary, journal requirements are typically set by editors who are professional archaeologists keen to tailor their journal to be attractive and relevant to other members of the discipline. Thus journal requirements for article length will reflect the norms of the discipline at any given time. The number of substantive and informative words in an article’s title tends to be positively correlated with article length in harder disciplines (Yitzhaki, 2002; Yitzhaki, 1997), reflecting a focus on empiricism and efficiency that is characteristic of high-consensus disciplines. While Yitzhaki (2002) removed stop-words (e.g. prepositions, articles, conjunctions, etc.) to calculate article length, in order to generate results for comparison with Fanelli and Glänzel (2013) I follow their method of dividing the total word count of the article title by the total number of pages of the article to compute relative title length.
The age of works cited has long been used as a measure of a field’s hardness (Börner, 2010; Moed et al., 1998), based on the assumption that harder fields assimilate new results more rapidly that softer fields (Price, 1970). I calculated a recency of references index for each article (also known as the Price index), which is the proportion of all cited works that were published in the five years preceding the paper. The diversity of references is a similar indicator, with papers in harder fields having a higher concentration of more specific citations because more knowledge is taken for granted as core knowledge that does not need citing (Skilton, 2006). Conversely, softer fields have less knowledge taken for granted, a smaller core of facts that do not need citing, and thus a higher diversity of citations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# these data were prepared using the code in 000-import-raw-data.R
items_df <-
read_rds(here::here("analysis/data/wos-data-df.rds"))
n_articles <- nrow(items_df)
year_max <- max(items_df$year)
year_min <- min(items_df$year)
items_df_2012 <-
items_df %>%
filter(year == 2012)
# how many archaeology articles in 2012
n_items_df_2012 <- nrow(items_df_2012)
# how many after 2012?
n_items_df_after_2012 <-
items_df %>%
filter(year %in% 2013:year_max) %>%
nrow()
# what proportion of archaeology articles published after 2012
prop_pub_after2012 <- n_items_df_after_2012 / n_articles
# how many distinct journals?
n_journals <- n_distinct(items_df$journal)
While Fanelli and Glänzel (2013) analysed papers published in a single year (2012), I found only 303 papers for that same year, and 70% of papers in the sample published after that date. To make efficient use of the available data and ensure robust representation from different areas of archaeology, including those with lower frequencies of journal article publication, I analysed 9697 papers published during 1975-2025. This sample was collected from Clarivate’s Web of Science database by first selecting the Web of Science category ‘Archaeology’ and the Document type ‘article’ (n = 28,871). To focus on journals of broad relevance to most archaeologists, and that are representative of substantial communities of practice, I then filtered the results to keep only articles published in the top-ranking 25 journals according to their 2022 Impact Factor as reported by Clarivate’s Journal Citation Indicator. Finally, I excluded journals with less than 100 articles in the database, resulting in 20 journals.
The entire R code (R Core Team, 2024) used for all the analysis and visualizations contained in this paper is at https://doi.org/10.5281/zenodo.14897252 to enable re-use of materials and improve reproducibility and transparency (Marwick, 2017). All the figures, tables, and statistical test results presented here can be independently reproduced with the code and data in this compendium (Marwick et al., 2018). The R code is released under the MIT license, the data as CC-0, and figures as CC-BY, to enable maximum re-use.
library(ggrepel)
source(here::here("analysis/code/001-redraw-Fanelli-and-Glanzel-Fig-2.R"))
base_size <- 6
color <- c('#d95f02', '#7570b3', '#1b9e77')
alpha <- 0.2
linewidth <- 0.1
# Number of authors ------------------
boxlplot_n_authors <-
ggplot() +
# boxplot of data from this study
geom_boxplot(data = items_df %>%
filter(!is.na(year)),
aes(1, log(authors_n)),
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "N. of authors (ln)",
Category %in% c("h", "p", "s")),
aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "N. of authors (ln)",
Category %in% c("h", "p", "s")) %>%
group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),
aes(c(0.75, 1, 1.25), y, label = label),
color = color,
bg.colour = "white",
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(0, 5)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("N. of authors (ln)") +
xlab("Collaborator group size")
# Relative title length ----------------
items_df_title <-
items_df %>%
filter(!is.na(pages_n)) %>%
filter(!is.na(title_n)) %>%
mutate(relative_title_length = log(title_n / pages_n))
boxlplot_rel_title_length <-
items_df_title %>%
filter(!is.na(year)) %>%
ggplot(aes(1,
relative_title_length)) +
geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Relative title length (ln)",
Category %in% c("h", "p", "s")),
aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Relative title length (ln)",
Category %in% c("h", "p", "s")) %>%
group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),
aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white",
colour = color,
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(-4.5, 3),
breaks = seq(-5, 5, 1),
labels = seq(-5, 5, 1)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Ratio of title length to article length (ln)") +
xlab("Relative title length")
# Number of pages ------------------
boxlplot_n_pages <-
items_df %>%
ggplot(aes(1,
log(pages_n))) +
geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "N. of pages (ln)",
Category %in% c("h", "p", "s")),
aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "N. of pages (ln)",
Category %in% c("h", "p", "s")) %>%
group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),
aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white",
colour = color,
bg.r = .2,
force = 0) +
scale_y_reverse(limits = c(5, 0)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("N. of pages (ln)") +
xlab("Article length")
# Price's index - age of references ------------------
library(stringr)
# output storage
prices_index <- vector("list", length = nrow(items_df))
# loop, this takes a moment
for(i in seq_len(nrow(items_df))){
refs <- items_df$refs[i]
year <- items_df$year[i]
ref_years <-
as.numeric(str_match(str_extract_all(refs, ", [0-9]{4}, ")[[1]], "\\d{4}"))
preceeding_five_years <-
seq(year - 5, year, 1)
refs_n_in_preceeding_five_years <-
ref_years[ref_years %in% preceeding_five_years]
prices_index[[i]] <-
length(refs_n_in_preceeding_five_years) / length(ref_years)
# for debugging
# print(i)
}
prices_index <- flatten_dbl(prices_index)
# add to data frame
items_df$prices_index <- prices_index
# plot
boxlplot_price_index <-
items_df %>%
ggplot(aes(1,
prices_index)) +
geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Price's index",
Category %in% c("h", "p", "s")),
aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Price's index",
Category %in% c("h", "p", "s")) %>%
group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),
aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white", colour = color,
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Prop. refs in last 5 years") +
xlab("Recency of references")
# Shannon index - diversity of references ------------------
# journal name as species, article as habitat
# simplify the refs, since they are a bit inconsistent, some of
# these steps take a few seconds
ref_list1 <- map(items_df$refs, ~tolower(.x))
ref_list2 <- map(ref_list1, ~str_replace_all(.x, "\\.|,| ", ""))
ref_list3 <- map(ref_list2, ~str_split(.x, "\n"))
ref_list4 <- map(ref_list3, ~data_frame(x = .x))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ref_list5 <- bind_rows(ref_list4, .id = "id")
ref_list6 <- unnest(ref_list5)
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(x)`.
# get the journal names out of the refs
ref_list7 <-
ref_list6 %>%
mutate(journal_name = gsub("\\-", "", x)) %>%
mutate(journal_name = gsub("\\:", "", journal_name)) %>%
mutate(journal_name = gsub("^[a-z'\\(\\)\\:]+[0-9]{4}", "", journal_name)) %>%
mutate(journal_name = gsub("v[0-9]+.*", "", journal_name)) %>%
mutate(journal_name = gsub("p[0-9]+$", "", journal_name))
# prepare to compute shannon and join with other variables
items_df$id <- 1:nrow(items_df)
# tally of all referenced items
all_cited_items <-
ref_list7 %>%
select(x) %>%
group_by(x) %>%
tally() %>%
arrange(desc(n))
# get a list of the top journals
top_journals <-
ref_list7 %>%
select(journal_name) %>%
group_by(journal_name) %>%
tally() %>%
filter(n > 50) %>%
arrange(desc(n))
# In the Shannon index, p_i is the proportion (n/N) of individuals of one particular species (journal) found (n) divided by the total number of individuals found (N), ln is the natural log, Σ is the sum of the calculations, and s is the number of species.
# compute diversity of all citations for each article (habitat)
shannon_per_item <-
ref_list7 %>%
group_by(id, journal_name) %>%
tally() %>%
group_by(id) %>%
mutate(p_i = n / sum(n, na.rm = TRUE)) %>%
mutate(p_i_ln = log(p_i)) %>%
group_by(id) %>%
summarise(shannon = -sum(p_i * p_i_ln, na.rm = TRUE)) %>%
mutate(id = as.numeric(id)) %>%
arrange(id) %>%
left_join(items_df)
## Joining with `by = join_by(id)`
# plot
boxlplot_shannon_index <-
shannon_per_item %>%
filter(!is.na(year)) %>%
ggplot(aes(1,
shannon)) +
geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Shannon div. of sources",
Category %in% c("h", "p", "s")),
aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Shannon div. of sources",
Category %in% c("h", "p", "s")) %>%
group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),
aes(c(0.75, 1, 1.25), y, label = label),
colour = color,
bg.colour = "white",
bg.r = .2,
force = 0) +
scale_y_reverse(limits = c(6, 0)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Shannon Index") +
xlab("Diversity of references")
library(ggplot2)
library(tools)
library(stringr)
library(purrr)
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
##
## stamp
plot_grid(boxlplot_n_authors,
boxlplot_rel_title_length,
boxlplot_n_pages,
boxlplot_price_index,
boxlplot_shannon_index,
nrow = 2)
## Warning: Removed 53 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 38 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Distributions of article characteristics hypothesised to reflect the level of consensus. The boxplot shows the distribution of values of archaeology articles. The thick black line in the middle of the boxplot is the median value, the box represents the inter-quartile range (the range between the 25th and 75th percentiles, where 50% of the data are located), and individual points represent outliers. The smaller coloured boxplots indicate the values computed by Fanelli and Glanzel (2013), where p = physics, s = social sciences, h = humanities. ln denotes the natural logarithm, or logarithm to the base e.
(fig-compare-other-fields?) shows the distribution of bibliometric variables for archaeology in the context of data from other fields presented by Fanelli and Glänzel (2013). The most striking indicator of archaeology as a hard science is the number of authors, where it is between the social sciences and physics. Archaeology is a close fit with the social sciences in relative title length. It is between the social sciences and humanities in recency of references and diversity of references. The clearest indicator of archaeology as a soft science is article length where it is similar to the humanities. Overall, archaeology does not sit squarely at either end of the hard-soft spectrum. It is generally not a harder science than the social sciences, with the exception of collaborator group sizes.
over_time <-
items_df %>%
left_join(items_df_title) %>%
left_join(shannon_per_item) %>%
filter(relative_title_length != -Inf,
relative_title_length != Inf,
prices_index != "NaN"
) %>%
mutate(log_authors_n = log(authors_n),
log_pages_n = log(pages_n),
journal_wrp = str_wrap(journal, 30)) %>%
select(year,
log_authors_n,
log_pages_n,
prices_index,
shannon,
relative_title_length)
## Joining with `by = join_by(authors, authors_n, title, title_n, journal,
## abstract, refs, refs_n, pages_n, year, doi)`
## Joining with `by = join_by(authors, authors_n, title, title_n, journal,
## abstract, refs, refs_n, pages_n, year, doi, prices_index, id)`
## Adding missing grouping variables: `authors`, `title`, `journal`
over_time_long <-
over_time %>%
ungroup() %>%
select(-journal) %>%
gather(variable,
value,
-year) %>%
filter(value != -Inf,
value != Inf) %>%
mutate(variable = case_when(
variable == "log_authors_n" ~ "N. of authors (ln)",
variable == "log_pages_n" ~ "N. of pages (ln)",
variable == "prices_index" ~ "Recency of references",
variable == "shannon" ~ "Diversity of references",
variable == "relative_title_length" ~ "Relative title length (ln)"
)) %>%
filter(!is.na(variable)) %>%
filter(!is.nan(value)) %>%
filter(!is.na(value)) %>%
filter(value != "NaN") %>%
mutate(value = parse_number(value))
# compute beta estimates so we can colour lines to indicate more or
# less hard
library(broom)
over_time_long_models <-
over_time_long %>%
group_nest(variable) %>%
mutate(model = map(data, ~tidy(lm(value ~ year, data = .)))) %>%
unnest(model) %>%
filter(term == 'year') %>%
mutate(becoming_more_scientific = case_when(
variable == "N. of authors (ln)" & estimate > 0 ~ "TRUE",
variable == "N. of pages (ln)" & estimate < 0 ~ "TRUE",
variable == "N. of refs (sqrt)" & estimate < 0 ~ "TRUE",
variable == "Recency of references" & estimate > 0 ~ "TRUE",
variable == "Relative title length (ln)" & estimate > 0 ~ "TRUE",
variable == "Diversity of references" & estimate < 0 ~ "TRUE",
TRUE ~ "FALSE"
))
# join with data
over_time_long_colour <-
over_time_long %>%
left_join(over_time_long_models)
## Joining with `by = join_by(variable)`
library(ggpmisc)
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
##
## Attaching package: 'ggpp'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
formula <- y ~ x
over_time_long_colour_gams <-
over_time_long_colour %>%
nest(.by = variable) %>%
mutate(mod_gam = lapply(data,
function(df) gam(year ~ s(value, bs = "cr"),
data = df)))
over_time_long_colour_gams_summary <-
over_time_long_colour %>%
nest(.by = variable) %>%
mutate(fit = map(data, ~mgcv::gam(year ~ s(value, bs = "cs"), data = .)),
results = map(fit, glance),
R.square = map_dbl(fit, ~ summary(.)$r.sq)) %>%
unnest(results) %>%
select(-data, -fit) %>%
select(variable, adj.r.squared)
over_time_long_colour_gams_summary_df <-
over_time_long_colour %>%
left_join(over_time_long_colour_gams_summary)
## Joining with `by = join_by(variable)`
plot = ggplot() +
geom_point(data = over_time_long_colour_gams_summary_df,
aes(year,
value,
colour = becoming_more_scientific),
alpha = 0.5) +
geom_smooth(data = over_time_long_colour_gams_summary_df,
aes(year, value),
method="gam",
formula = y ~ s(x, bs = "cs"),
se = FALSE,
size = 2,
colour = "#7570b3") +
facet_wrap( ~ variable,
scales = "free_y") +
theme_bw(base_size = base_size) +
scale_color_manual(values = c("#d95f02",
"#1b9e77" )) +
guides(colour = "none") +
ylab("") +
geom_text(data = over_time_long_colour_gams_summary_df %>%
group_by(variable) %>%
summarise(max_value = max(value),
adj.r.squared = unique(adj.r.squared)),
aes(
x = 1980,
y = max_value,
label = paste("Pseudo R² = ",
signif(adj.r.squared,
digits = 3))),
hjust = 0,
vjust = 1.5,
size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# get this from the supplement on GAMS
knitr::include_graphics(here::here("analysis/figures/fig-smooth-plots-paper.png"))
(fig-change-over-time?) shows how the bibliometric indicators of field hardness have changed of time for archaeology articles. By two measures, the number of authors and relative title length, archaeology has become increasingly harder over time. On the other hand, three metrics indicate that archaeology has become softer (diversity of references, article length and recently of references). Although all the relationships are statistically significant, generally these temporal trends are very weak with low slope values, indicating very slow change over time. Similarly the r-squared values are very low, demonstrating that much of the variability in these metrics is independent of time.
The most striking change over time is in the increase in the number of authors, which has the highest r-squared value of these metrics. One interesting detail evident in (fig-change-over-time?) is the increase in the range of diversity of references after about 2010. This may be due to some broader changes in academic publishing around this time, such as moves to digital-first continuous publishing, new journals appearing (e.g. Archaeological and Anthropological Sciences in 2009 and Journal of Island & Coastal Archaeology in 2010), and non-archaeology journals becoming more relevant to archaeologists. For example, PLOS ONE received its first impact factor in 2010 and in 2011 Nature’s Scientific Reports began publishing (Malashichev, 2017). The appearance of Google Scholar in 2004, increasing the discoverability of many works for many researchers, may have also contributed to this increase in diversity of references.
journal_title_size <- 2
# get rank order of journals by these bibliometic variables
journal_metrics_for_plotting <-
items_df %>%
left_join(items_df_title) %>%
left_join(shannon_per_item) %>%
ungroup() %>%
select(journal,
authors_n, # log
pages_n, # log
relative_title_length,
prices_index,
shannon
) %>%
filter(relative_title_length != -Inf,
relative_title_length != Inf,
prices_index != "NaN"
) %>%
mutate(
log_authors = log(authors_n),
log_pages = log(pages_n)
)
## Joining with `by = join_by(authors, authors_n, title, title_n, journal,
## abstract, refs, refs_n, pages_n, year, doi)`
## Joining with `by = join_by(authors, authors_n, title, title_n, journal,
## abstract, refs, refs_n, pages_n, year, doi, prices_index, id)`
journal_metrics_for_plotting_summary <-
journal_metrics_for_plotting %>%
mutate(journal = str_wrap(journal, 20)) %>%
group_by(journal) %>%
summarise(mean_log_authors = mean(log_authors),
mean_log_pages = mean(log_pages),
mean_relative_title_length = mean(relative_title_length),
mean_prices_index = mean(prices_index),
mean_shannon = mean(shannon))
# PCA of journal means
journal_metrics_for_plotting_summary_pca <-
journal_metrics_for_plotting_summary %>%
column_to_rownames("journal") %>%
prcomp(scale = TRUE)
# Tidy the PCA results
pca_means_tidy <- journal_metrics_for_plotting_summary_pca %>% tidy(matrix = "pcs")
# first two PCs explain how much?
# Get the summary of the PCA
pca_summary <- summary(journal_metrics_for_plotting_summary_pca)
# Extract the proportion of variance explained by PC1 and PC2
variance_explained <- round(pca_summary$importance[2, 1:2] * 100, 0)
# Get the PCA scores
pca_scores_means <- journal_metrics_for_plotting_summary_pca %>% augment(journal_metrics_for_plotting_summary)
# Get the PCA loadings
pca_loadings_means <-
journal_metrics_for_plotting_summary_pca %>%
tidy(matrix = "rotation") %>%
pivot_wider(names_from = "PC",
values_from = "value",
names_prefix = "PC") %>%
mutate(column = case_when(
column == "mean_log_authors" ~ "Number of\nauthors",
column == "mean_log_pages" ~ "Number of\npages",
column == "mean_relative_title_length" ~ "Relative\ntitle\nlength",
column == "mean_prices_index" ~ "Recency of\nreferences",
column == "mean_shannon" ~ "Diversity of\nreferences",
))
# Plot the PCA results
plot_pca_means <-
ggplot() +
labs(x = paste0("PC1 (", variance_explained[1], "%)"),
y = paste0("PC2 (", variance_explained[2], "%)")) +
geom_point(data = pca_scores_means,
aes(.fittedPC1,
.fittedPC2),
size = 1) +
geom_text_repel(data = pca_scores_means %>%
mutate(label = str_replace(journal,
"JOURNAL",
"J.")) %>%
mutate(label = str_remove(label,
"-AN\nINTERNATIONAL\nJ.")),
aes(.fittedPC1,
.fittedPC2,
label = label),
lineheight = 0.8,
segment.color = NA,
force_pull = 10,
size = 2.5,
bg.color = "white", # Color of the halo
bg.r = 0.2) +
geom_segment(data = pca_loadings_means,
aes(x = 0,
y = 0,
xend = PC1,
yend = PC2),
arrow = arrow(length = unit(0.2, "cm")),
color = "grey70") +
geom_text_repel(data = pca_loadings_means,
aes(PC1,
PC2,
label = column),
size = 2,
lineheight = 0.8,
force = 10,
force_pull = 0,
segment.color = NA,
color = "grey40",
bg.color = "white", # Color of the halo
bg.r = 0.2) +
theme_minimal(base_size = base_size) +
coord_fixed(xlim = c(-6, 2.5),
ylim = c(-3, 2))
# tricky to get the label spacing right, let's save an SVG, edit
# by hand, then export to PNG and read that file later.
ggsave(plot_pca_means,
filename = here::here("analysis/figures/plot_pca_means.svg"))
## Saving 7 x 5 in image
# looking into rankings of the journals
journal_summary_metrics_ranks <-
journal_metrics_for_plotting_summary %>%
mutate(across(starts_with("mean"),
~ rank(-.),
.names = "rank_{.col}")) %>%
select(journal, starts_with("rank")) %>%
# reorder by hardness
mutate(rank_mean_log_pages = 21 - rank_mean_log_pages,
rank_mean_shannon = 21 - rank_mean_shannon)
library(irr)
## Loading required package: lpSolve
journal_summary_metrics_ranks_test <-
journal_summary_metrics_ranks %>%
select(-journal) %>%
kendall(correct = TRUE)
# Convert to scientific text
pretty_print_sci <- function(num){
scientific_text <- paste0(gsub("e", " x 10^", # Replace 'e' with ' x 10^'
sprintf("%.2e", num)), "^") # round to 2 sf
return(scientific_text)
}
borda_count_tbl <- function(votes_tbl) {
# Number of voters
num_voters <- ncol(votes_tbl) - 1
# Calculate scores for each option
scores <- votes_tbl %>%
rowwise() %>%
mutate(Score = sum(num_voters - c_across(starts_with("rank_")))) %>%
ungroup() %>%
select(1, Score)
# Return scores
return(scores)
}
# Calculate Borda Count scores
borda_scores <-
journal_summary_metrics_ranks %>%
borda_count_tbl() %>%
rename("Journal" = "journal") %>%
arrange(desc(Score))
plot_borda_scores <-
borda_scores %>%
mutate(Journal = str_wrap(Journal, 20)) %>%
ggplot() +
aes(reorder(Journal, Score),
Score) +
geom_col() +
coord_flip() +
ylab("Borda Count scores") +
xlab("") +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size))
library(ggridges)
plot_journals_authors <-
journal_metrics_for_plotting %>%
mutate(journal = str_wrap(journal, 20)) %>%
ggplot(aes(y = reorder(journal,
log_authors,
FUN = mean),
x = log_authors,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Number of authors (ln)")
plot_journals_article_length <-
journal_metrics_for_plotting %>%
mutate(journal = str_wrap(journal, 20)) %>%
ggplot(aes(y = reorder(journal,
-log_pages,
FUN = mean),
x = log_pages,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
xlab("Number of pages (ln)") +
ylab("")
plot_journals_title_length <-
journal_metrics_for_plotting %>%
mutate(journal = str_wrap(journal, 20)) %>%
ggplot(aes(y = reorder(journal,
relative_title_length,
FUN = mean),
x = relative_title_length,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Relative title length (ln)")
plot_journals_ref_recency <-
journal_metrics_for_plotting %>%
mutate(journal = str_wrap(journal, 20)) %>%
group_by(journal) %>%
ggplot(aes(y = reorder(journal,
prices_index,
FUN = mean),
x = prices_index,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Recency of references")
plot_journals_ref_diversity <-
journal_metrics_for_plotting %>%
mutate(journal = str_wrap(journal, 20)) %>%
group_by(journal) %>%
ggplot(aes(y = reorder(journal,
-shannon,
FUN = mean),
x = shannon,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Diversity of references")
library(cowplot)
plot_grid(plot_journals_authors,
plot_journals_article_length,
plot_journals_title_length,
plot_journals_ref_recency,
plot_journals_ref_diversity,
plot_borda_scores,
nrow = 2,
labels = LETTERS[1:6],
label_size = 6)
Panels A-E: Variation in bibliometric indicators of hardness for 20 archaeological journals. The journals are ordered for each indicator so that within each plot, the harder journals are at the top of the plot and the softer journals are at the base. Panel F shows a bar plot that is the single consensus ranking computed from all five variables, using the Borda Count ranking algorithm.
knitr::include_graphics(here::here("analysis/figures/plot_pca_means.png"))
(fig-variation-by-journal?) shows the distribution of our bibliometric variables of hardness for each of the 20 journals in the sample. Overall agreement between these bibliometric variables in ranking these journals on a hard-soft spectrum is moderate to strong, with a Kendall’s coefficient of concordance (Wt) value of 0.7 (in a 0-1 range, where 1 is perfect agreement) and a p-value of 4.08 x 10-07. Panel F of (fig-variation-by-journal?) shows an overall consensus ranking of all journals in the sample. In this consensus ranking the Journal of Archaeological Science among the top five archaeology journals for hardness. It is placed at the harder end of the hard-soft spectrum especially by the number of pages and relative title length, and to lesser degrees by the number of authors and recency of references. However, according to the diversity of references, the Journal of Archaeological Science is at the middle of the spectrum.
The Journal of Cultural Heritage is the only journal that consistently ranks as hard across all variables, occurring in the top five journals for all five metrics. This journal primarily publishes materials science and computational analyses related to conservation and preservation of historic objects in museums and other collections. Authors of papers in recent issues have affiliations with museums, cultural heritage programs, and chemistry, engineering, and physics departments at European and Chinese universities. Notably, papers in this journal typically do not engage in questions or debates about past human behaviour or culture. The absence of these questions in research published in this journal makes it an outlier here, since these questions are central to a common definition of archaeology as ‘cultural anthropology of the past’, a phrase first found in Leroi-Gourhan (1946) and repeated in widely-used contemporary undergraduate textbooks such as Renfrew et al. (2024). Most archaeologists would likely be surprised at the decision by Clarivate to include the Journal of Cultural Heritage in their category of archaeology journals, leading to this result in (fig-variation-by-journal?) where the hardest archaeology journal publishes papers that are not very archaeological because they do not engage with anthropological topics.
The Journal of Archaeological Research is notable for consistently ranking as soft; it was the softest journal for four of our five bibliometric variables. This is a predicable result for a review journal, which is a distinct type of journal dedicated to summarizing, analyzing, and synthesizing existing research in a particular field. The stated aim of the Journal of Archaeological Research is to ‘bring together the most recent international research summaries on a broad range of topics and geographical areas’ (Feinman and Parkinson, 2024). A typical article is a long single-authored synthesis of archaeology in a region or on a topic. As the only review journal in this sample, this is a stark contrast to the other journals here that present original research findings, and like the Journal of Cultural Heritage, may be considered an outlier in this sample.
The PCA results in (fig-pca?) shows that PC1 captures most of the variance in the metrics (72%) and is a reasonable proxy for the hard-soft spectrum, with Journal of Cultural Heritage representing the hard extreme on the right and Journal of Archaeological Research representing the soft extreme on the left. The variables that contribute to variation in PC1 are title length, number of pages, and the diversity of references. Journals with higher PC1 values have articles with longer titles, fewer pages, and less diverse reference lists. The distribution of PC1 values is skewed left, with most of the journals concentrated at the harder end of the spectrum. Variation in PC2 is influenced by the number of authors and recency of references. The distribution of PC2 values reveals additional structure to the data and can be roughly separated into generalist journals in the negative range of the PC2 axis (e.g. American Antiquity, Antiquity, Advances in Archaeological Practice), characterised by fewer authors and more recent references. In the positive range of the PC2 axis are more specialised journals, characterised by higher numbers of authors and less recent references cited (e.g. Environmental Archaeology, Geoarchaeology, Archaeological Research in Asia, Journal of Island and Coastal Archaeology). The Journal of Archaeological Science sits about midway between these two groups, reflecting its relevance to both specialised and generalist communities of practice in archaeology.
This macroscopic perspective derived from an analysis of the ways thousands of archaeologists communicate their research has produced a complex picture of archaeology as a science. In the context of a broad spectrum of other research areas, archaeologists behave like social scientists. We are harder than typical social scientists in tending to form larger groups of collaborators more often, and softer in sometimes writing longer articles that more resemble humanities scholarship. The outlook for the future of archaeology is also complex, with three out of five of the bibliometric variables trending towards more humanistic styles of working, but the discipline showing more extreme values in some metrics towards both hard and soft sciences after about 2010. Among archaeology journals, we see distinct communities of practice reflected in the PCA results that are very close together on the hard-soft spectrum, but have minor differences in their communication styles, perhaps due to cultural differences in writing traditions inherited from parent disciplines such as geology and biology (Becher and Trowler, 2001).
While these bibliometric variables provide several interesting insights into the status of archaeology as a science, via measurement of consensus, and are important for moving the debate beyond discussions of a small number of case studies, they miss a crucial factor that separates scientific practice from non-science. This is reproducibility, which, according to a report for the US National Science Foundation (Cacioppo et al., 2015), “refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results… Reproducibility is a minimum necessary condition for a finding to be believable and informative.” Scientific reproducibility is a factor contributing to the hardness of a field. Specifically, reproducibility is linked to concept of consensus in a field because if more researchers provide sufficient detail for others to reproduce their results, then consensus on new knowledge can more often, and more rapidly, be established. The importance of this factor can be traced to Irish chemist Robert Boyle (1627-1691), best known for his experiments with vacuum pumps (Shapin and Schaffer, 2011). Boyle was concerned about the secrecy common among experimentalists in the 17th century and aimed to shift the culture from valuing direct in-person witnessing of scientific demonstrations towards meticulous written communications that were detailed enough to enable a reader to successfully undertake the experiment themselves, independent of the original author.
With many disciplines making increasing use of computationally intensive analyses in recent years there has been renewed interest in reproducibility (LeVeque et al., 2012). In part, this is because computationally intensive research is difficult to communicate within the constraints of the methods section of a traditional journal article — the reader also needs the computer code written by the original authors, not just the article text. There is also the broader context of rising pressure to publish in prestigious journals and intense competition for funds that create strong incentives for malpractice in research (Edwards and Roy, 2017). These two factors have led to widespread concerns of a reproducibility crisis in many fields (Baker, 2016). Estimates of scientific reproducibility in several fields confirm the extent of this problem. Empirical replications of 100 studies published in three psychology journals found that 36% of replications had statistically significant results, compared to 96% of the original studies (Open Science Collaboration, 2015). Similar empirical replications of large numbers of social science studies and experimental economics studies successfully replicated 61% and 62% of their target studies respectively (Camerer et al., 2018; Camerer et al., 2016).
Similarly bleak results come from measurements specifically of the reproducibility of computational analyses of scientific studies. An attempt at reproducing the computational results of 204 papers in Science resulted in success in reproducing the findings for 26% (Stodden et al., 2018). The computational results of two out of 41 geoscience papers could be fully reproduced on the first attempt (Konkol et al., 2019). In the biomedical field, code in 1,203 out of 27,271 (4%) notebooks associated with 3,467 publications could be run without errors (Samuel and Mietchen, 2024). Statisticians could reproduce 15% of 93 papers (Xiong and Cribben, 2023). Economists have been especially active in researching computational reproducibility, with studies indicating successful reproduction of results using code and data provided by authors for 30% of 67 papers (Chang and Li, 2015), 14% of 203 papers (Gertler et al., 2018), 44% of 152 papers (Herbert et al., 2021), 30% of 419 articles (Fišar et al., 2024), and 28% of 168 papers (Pérignon et al., 2024). These efforts confirm that the reproducibility of published research is widely recognised as a cornerstone of rigorous science, and work on evaluating how successful a research community is at generating reproducible results has become a distinctive and important meta-research activity in many fields.
How does archaeology compare to these other fields in terms of reproducibility? Empirical reproducibility has long been valued in field archaeology. Throughout the history of archaeology, well-known sites have been repeatedly revisited to test old hypotheses with new evidence or methods, for example, Olduvai Gorge (Tanzania), Cahokia (USA), Çatalhöyük (Turkey), and Madjedbebe (Australia). Similarly among many experimental archaeologists, empirical reproducibility is a key concern, for example in lithic use-wear identification (Hayes et al., 2017) and the measurement of lithics (Pargeter et al., 2023). The increasing availability of large digital datasets is pushing archaeology into unexplored areas (Bevan, 2015), inviting questions about what reproducibility means for data intensive archaeological research. For example, to what extent does concern for reproducibility extend to computational reproducibility among archaeologists?
In 2024 the Journal of Archaeological Science introduced a new kind of peer review that has provided an opportunity to tackle this question about computational reproducibility in archaeology. In January 2024 I accepted the position of ‘Associate Editor for Reproducibility’ (AER) for JAS and conducted reproducibility reviews of submissions that mentioned programming languages such as R or Python in the methods sections, taking guidance from similar initiatives in other fields (e.g. Ivimey-Cook et al., 2023; Nüst and Eglen, 2021). A reproducibility review examines the code and data used to generate the results presented in the paper, and attempts to run the authors’ code to reproduce their results (see Editors (n.d.) for more details about this process). This new AER role is based on similar positions (i.e. ‘data editor’ or ‘reproducibility editor’) that journals in economics (Vilhuber, 2019), statistics (Wrobel et al., 2024), astronomy (Muench, 2023), ecology (Bolnick et al., 2022), and environmental studies (Rosenberg et al., 2021) have had, in some cases for over a decade. In 2024, three archaeology journals, in addition to JAS, added AERs to their editorial communities: Advances in Archaeological Practice (Marwick, 2024, one paper reviewed), Journal of Field Archaeology (Farahani, 2024, two papers reviewed), and American Antiquity (Martin, 2024, no papers reviewed at the time of writing).
# reporting on my AER work
# initially collected here:
# https://docs.google.com/spreadsheets/d/1o4jSZ__OoCDssf0lWWeBheNEbgiheU_1Udj5X82VOvY/edit?gid=0#gid=0
jas_rr_data <-
read_csv(here::here("analysis/data/JAS AER data analysis.csv"))
## Rows: 25 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Methods2, Language, Run on first try?, Run eventually?, README, Co...
## dbl (1): N
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
base_size <- 6
# what software did the authors use?
aer_plot_software <-
jas_rr_data %>%
separate_longer_delim(Language, ",") %>%
mutate(Language = str_squish(Language)) %>%
count(Language) %>%
drop_na() %>%
mutate(Language = fct_reorder(Language, n, .desc = FALSE)) %>%
ggplot() +
aes(Language, n) +
geom_bar(stat = 'identity') +
coord_flip() +
xlab("") +
ylab("Number of papers reviewed") +
theme_minimal(base_size = base_size)
# what methods did the authors use? ML includes CNN, VAR, DL
aer_plot_methods <-
jas_rr_data %>%
separate_longer_delim(Methods2, ",") %>%
mutate(Methods2 = str_squish(Methods2)) %>%
count(Methods2) %>%
drop_na() %>%
mutate(Methods = fct_reorder(Methods2, n, .desc = FALSE)) %>%
ggplot() +
aes(Methods, n) +
geom_bar(stat = 'identity') +
coord_flip() +
xlab("") +
ylab("Number of papers reviewed") +
theme_minimal(base_size = base_size)
# where did they put their materials?
aer_plot_repo <-
jas_rr_data %>%
separate_longer_delim(Repo, ",") %>%
mutate(Repo = str_squish(Repo)) %>%
count(Repo) %>%
drop_na() %>%
mutate(Repo = fct_reorder(Repo, n, .desc = FALSE)) %>%
ggplot() +
aes(Repo, n) +
geom_bar(stat = 'identity') +
coord_flip() +
xlab("") +
ylab("Number of papers reviewed") +
theme_minimal(base_size = base_size) +
scale_y_continuous(breaks = scales::breaks_pretty())
# could reproduce the results on the first try?
aer_first_try <-
jas_rr_data %>%
count(`Run on first try?`) %>%
drop_na() %>%
mutate(`Run on first try?` = fct_reorder(`Run on first try?`, n, .desc = FALSE))
# why not? issues with code
aer_plot_issues <-
jas_rr_data %>%
separate_longer_delim(`Code issue`, ",") %>%
mutate(`Code issue` = str_squish(`Code issue`)) %>%
mutate(`Code issue` = ifelse(`Code issue` == "Proprietary software",
"Proprietary\nsoftware",
`Code issue`)) %>%
count(`Code issue`) %>%
drop_na() %>%
mutate(`Code issue` = fct_reorder(`Code issue`, n, .desc = FALSE)) %>%
ggplot() +
aes(`Code issue`, n) +
geom_bar(stat = 'identity') +
coord_flip() +
xlab("") +
ylab("Number of papers reviewed") +
theme_minimal(base_size = base_size) +
scale_y_continuous(breaks = scales::breaks_pretty())
# relationship between code issues and language?
aer_plot_issues_lang <-
jas_rr_data %>%
separate_longer_delim(`Code issue`, ",") %>%
separate_longer_delim(Language, ",") %>%
mutate(Language = str_squish(Language)) %>%
mutate(`Code issue` = str_squish(`Code issue`)) %>%
group_by(`Code issue`,
Language) %>%
tally() %>%
drop_na() %>%
ggplot() +
aes(`Code issue`,
Language,
size = n) +
geom_point() +
xlab("Problem with the code") +
ylab("Software") +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
theme_minimal(base_size = base_size) +
scale_size_area(breaks = c(1, 2, 3, 5)) +
theme(legend.position = c(.9,.7),
legend.title = element_text(hjust = 0.5),
legend.spacing = unit(0.1, "cm"), # Reduce space between legend items
legend.margin = margin(1, 5, 1, 1), # Reduce margin around the legend
legend.key.size = unit(0.2, "cm"), # Reduce size of legend keys
legend.box.background=element_rect(fill="white",
color="black"))
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(patchwork)
##
## Attaching package: 'patchwork'
## The following object is masked from 'package:cowplot':
##
## align_plots
(aer_plot_software + aer_plot_methods + aer_plot_repo) / (aer_plot_issues + aer_plot_issues_lang) + plot_annotation(tag_levels = 'A')
Summary of reproducibility reviews for JAS. A: Primary software used for the computational analysis reported in a manuscript. B: Computational or statistical method used by the authors (GMM = geometric morphometrics; Frequentist = hypothesis tests such as chi-square and ANOVA; AI/ML = artificial intelligence and machine learning, including neural networks and deep learning; MCMC = Markov Chain Monte Carlo, i.e. Bayesian models and other simulations; Network = statistical analysis of social networks; 3D = analysis of 3D data such as artefact models; Composition = compositional analysis of artefacts). C: Locations where authors deposited their code and data files. D: Issues that prevented the reproducible review from succeeding on the first attempt. E: Relationship between software used and issues that make research irreproducible.
At the time of writing (January 2025) we have completed 47 reproducibility reviews of 25 manuscripts submitted to JAS (most papers required multiple reviews). Of these, 11 have been published in JAS to date. Seven of these eleven papers fully passed the reproducibility review, resulting in a success rate, by one measure, of 63%. Four of the seven papers could be fully reproduced on my first attempt, the others required additional input from the authors. For comparison with reproducibility studies in other fields reported above, the seven fully reproducible papers should be divided by the 25 reviewed for reproducibility, resulting in a 28% success rate. Expanding the denominator to include the total number of research articles published in JAS from May 2024 (when the first article to pass the reproducibility review, Herskind and Riede (2024), was published) to January 2025 (n = 97) offers another perspective. These 97 articles that could have been eligible for reproducibility review had the authors used an open source programming language (e.g. instead of commercial software such as Microsoft Excel or SPSS, etc.). Under this broader scope, the success rate is 7%, a result also found in a study of 497 papers in 9 ecology journals (Kellner et al., 2025). By any measure, the computational reproducibility of archaeological research is generally on the low end of the distribution of values available from a variety of hard and soft sciences.
(fig-aer-summary?) shows a summary of basic characteristics of the 25 articles that have been through the reproducibility review process so far. The most commonly used software is R, followed by Python. Results generated with proprietary or closed-source software are out of scope for reproducibility reviews. Several distinct types of analyses are well-represented in this sample, especially geometric morphometry, network statistics, and analyses using artificial intelligence or machine learning algorithms (this includes deep learning and neural networks). Most authors are sharing their code and data files via Zenodo, a non-profit generic research data repository hosted by CERN that accepts any file format and freely assigns all publicly available uploads a DOI to make the files easily and uniquely citable (Peters et al., 2017). In this same category of DOI-issuing, research-grade repositories is OSF (the Open Science Foundation), Figshare, and university repositories. GitHub, a commercial service owned by Microsoft, is a code hosting platform that is convenient for collaboration, is also popular among JAS authors, but is a problematic choice because does not offer DOIs or the same commitments to long-term availability as Zenodo. Some authors attached their code and data as journal article supplementary files, but this is a poor choice for long-term availability because these files are typically renamed and converted to different formats during the article production process, making it difficult or impossible for a reader to combine the code and data to reproduce the results.
Panels D and E of (fig-aer-summary?) summarise the common issues that resulted in irreproducible results. The most common issue was an incomplete compendium. This ranges from missing data files down to missing lines of code. In most cases this can be attributed to accidental carelessness, with the exception of two cases where data was unavailable due to licensing restrictions. Unspecified or under-specified dependencies is another common issue that prevents code from running. This refers to the software packages in addition to R or Python that an author used to do specialized analyses and visualisations, (e.g. dplyr for R or numpy for Python). If an author does not clearly specify the name and version number of the packages that they used for their analysis, it can be very time-consuming or impossible to correctly identify these because many packages have functions with similar names, and functions in any one package can change the way they behave as the developers update their package. Other reasons why papers failed the reproducibility review is that the paths to data files were incorrectly specified (likely a result of the author reorganising their compendium after completing their analysis, or omitting data files from the materials submitted for review), and errors returned by functions, which have diverse causes.
Despite the relatively small number of reproducibility reviews reported on here, there are patterns of common issues that point to a small set of simple tasks authors can do that have high potential to increase reproducibility. The problem of incomplete materials can be tackled in several basic ways. First, authors should use a simple and logical folder structure to organise their code and data to be as self-contained as possible. Authors should provide their materials organised such that a reader can successfully run all code as-is, without making any manual modifications (e.g. use relative rather than absolute file paths so that readers don’t have to rename or move files around to make the code work) (Sandve et al., 2013). Code and data files should be in the simplest format possible, for example a plain text R script file is smaller and easier to use than a PDF or Word document that includes R code. Script files should have the order in which they are to be run explicit in the file name, e.g. 001-load-data.R, 002-clean-data.R, 003-analyse-data.R. There are many excellent, simple, and widely-used project templates that authors can choose from that make it easy for authors to follow best practices of project organisation, e.g. Marwick et al. (2018), Figueiredo et al. (2022), Greenfeld and Community (2023), Cooper and Hsing (2017) and Wilson et al. (2017)
Second, authors should include in their compendium a README document that describes to readers the folders and files contained in the project (Abdill et al., 2024). The README file is typically the first file that a reader will look at in a compendium so it should include brief instructions to guide the user to a successful reproduction of the original results (e.g. what order to run the code files in). A README should also briefly describe the contents of the compendium, where other necessary files can be obtained (e.g. data files that cannot be included in the compendium due to ethical or other reasons), the key software packages needed and the version numbers that the authors used, and if the analysis takes more than a few minutes to run on a typical laptop, the hardware resources and compute time used by the author.
Related to the basic documentation provided by the README, authors should document clear, direct and obvious connections between their code and the results they present in their paper (Sandve et al., 2013). One simple way to do this is to have one code file for each figure and table, and name the code files with the figure or table number and some key words in the caption. Another way some authors are accomplishing this is by using literate programming tools, such as Quarto and Jupyter notebooks (Allaire and Dervieux, 2024; Kluyver et al., 2016) that enable the research narrative and code for data analysis to be woven together in one document. Quarto was a popular tool among the JAS papers in the reproducibility review sample, for example, Vernon and Ortman (2024) and Ragno (2024) wrote their entire manuscripts using Quarto.
Documentation is also a key tool in tackling the problems with dependencies described in the previous section. Our finding that dependencies are a common cause of irreproducible results is consistent with previous studies that have identified this as a widespread weakness in communicating computationally intensive research (Samuel and Mietchen, 2024; Trisovic et al., 2022). In our sample, issues relating to dependencies are strongly associated with the use of Python. One possible reason for this is that relative to R, Python uses more package managers, more environments, and deeper dependency chains with more complex inter-dependencies that change more rapidly (Decan et al., 2019; Decan et al., 2016; Korkmaz et al., 2020). Another reason may be that there is a bigger and more established community of R users in archaeology (Batist and Roe, 2024; Schmidt and Marwick, 2020) that highly values code that is easy for others to reuse and has evolved practices to effectively communicate dependencies (e.g. Bilotti et al., 2024; Will and Rathmann, 2025).
The simplest way for archaeologists to improve here is to write the names and version numbers of the software and packages they used in their README file, as we see in Herskind and Riede (2024) and Monna et al. (2024). For more complex research projects, i.e. those using five or more packages or machine learning algorithms, authors should use dependency management tools to keep track of the packages and version numbers needed to reproduce their results. This is an active area of development, and while there are many tools currently available, the most robust and widely used include renv for R (Ushey and Wickham, 2025), see examples in Vernon and Ortman (2024) and Ragno (2024), and conda and poetry for Python (Anaconda, 2023; Crasta et al., 2023).
A more comprehensive solution, and the leading best practice for managing dependencies in many computationally intensive fields using Python in particular, is to include a Dockerfile in the compendium (Moreau et al., 2023). This is a set of machine- and human-readable instructions that enables a user to recreate the author’s computational environment (including those requirements beyond the R or Python packages) on another computer (Nüst et al., 2020). Dockerfiles are gradually being adopted by archaeologists, see Crema et al. (2024) and Liao et al. (2024) for examples. Most of our reproducibility reviews include a recommendation that the authors include a Dockerfile to manage complex dependencies efficiently.
Finally, for analyses that are not highly time-consuming (which was over 90% of the sample), authors should re-run their code more than once, and ideally not on the same computer (e.g. by another co-author of the paper), before submission to confirm everything works as expected (Abdill et al., 2024; Roth et al., 2025). This ensures the project is self-contained and portable and will help the authors detect and solve issues relating to path and function errors before they submit their work for review. Complex and time-consuming analyses should use pipeline or workflow management tools, e.g. GNU Make, Luigi, Snakemake, or Targets, to document the relationship of the files and folders in a machine-readable format and simplify running and re-running code by others (Landau, 2021; Wratten et al., 2021).
(fig-checklist?) summarises the key recommendations discussed in this section in a format that can be used as a check-list for authors submitting research for publication in JAS and other journals that do reproducibility reviews. This checklist is based on both the results presented in (fig-aer-summary?) and similar lists used by other journals, for example by the Biometrical Journal (Hornung et al., n.d.) and The Review of Financial Studies (Pérignon et al., 2024).
# I made this using Canva, original is here: https://www.canva.com/design/DAGnj7LkvvE/lrjyzV1WywPWMDdo9nwJJg/edit
knitr::include_graphics(here::here("analysis/figures/simple-reproducibility-checklist.png"))
Although the introduction of reproducibility reviews signifies a growth in computational archaeology and a desire to evaluate the research products beyond the journal article, a very substantial amount of archaeological research is qualitative, with few or no numerical data involved in making knowledge claims about the human past. For example, many archaeological questions can be answered by the simple presence or absence of artefacts or features, or qualitative comparisons of basic artefact characteristics such as shape, colour, surface treatments, and raw material. Chaîne opératoire analyses by ceramic and lithic specialists is an especially productive area of archaeological research that often relies on comparison of narratives of manufacturing processes. While these studies unquestionably count as science, because they are a systematic, empirical, and rigorous process of inquiry, should they be held to standards of reproducibility in the same way that computationally intensive research is?
A similar debate has been unfolding more broadly about the humanities where Peels and Bouter (2018a, 2018b) have argued that humanities disciplines that use empirical methods should be assessed by how well reanalysis of the original or new data using original or new methods produces the original or equivalent results. Resisting this proposal, Rijcke and Penders (2018) argue that humanities research is unique because it pursues value and meaning, and a given study can produce multiple valid answers relating to the value and meaning of a study object, so replication is irrelevant as a mark of quality. Peels (2018) disputes this uniqueness, claiming that the humanities has the same epistemic values as the sciences, however some values have more weight in the humanities while others have more weight in the sciences, and unlike in the sciences, humanities scholars often study these epistemic values themselves. A consensus seems to be emerging that for some but not all, studies in the humanities, replication is both possible and desirable, and that replication studies will differ from field to field and might even differ among various studies within a specific field (Bouter, 2019; Holbrook et al., 2019).
A key distinction here is that the locus of evaluation is empirical rather than computational, that is, concerned with appropriate reporting standards and documentation associated with physical evidence (e.g. artefacts and archives) (Stodden, 2015). A second important difference is that the debate about qualitative and humanities research is oriented towards replication (new data and/or new methods in an independent study to produce the same findings as original publication) rather than reproducibility (same data and same methods, e.g. computer code, to produce the same results as original publication) (cf Barba, 2018). This orientation to replication emphasizes triangulation techniques that compare and integrate results coming from different traditions, locations, sources and methods, which in turn supports testing whether any given inference is robust in the face of different lines of evidence (Leonelli, 2018). In sum, there are many types of qualitative and humanistic archaeology where it is possible and meaningful to maximize the chances of non-computational replication, e.g. by carefully documenting data generating processes, to produce higher quality and more impactful results.
In the classic satirical novel Gulliver’s Travels (1726) by Irish writer Jonathon Swift Gulliver visits the fictional Grand Academy of Lagado in Balnibarbi, a caricature of the Royal Society of London, and meets several researchers working on wildly impractical projects, including one attempting to extracting sunbeams out of cucumbers. This is usually interpreted as a subversive anti-colonial parody depicting institutionalized research as an absurd fund-raising activity with no practical benefits to society (Alff, 2014; Nicolson and Mohler, 1937). It has also been used as a metaphor for the difficulty of getting insights from data tables in scholarly publications (Feinberg and Wainer, 2011). This metaphor has additional relevanance in our current age of computationally intensive research, where my experience as a reproducibility reviewer attempting to extract useful code and data from the publication of a computational study has sometimes felt as frustrating and fruitless as extracting sunbeams from a cucumber.
I have presented a bibliometric analysis on the status of archaeology as a science, showing distinct disunity that is increasing over time. On average we generally behave as social scientists, with some elements in common with harder sciences. These observations are consistent with Lakatos (1978)’s model of a research program as a central foundation of irrefutable core assumptions complemented by a set of hypotheses, models, and methods that are adjusted, modified, or replaced by day-to-day research. Archaeology consists of multiple programs like this, as indicated by the spread of journals across PC2 of (fig-pca?), with distinct and sometimes non-overlapping sets of core assumptions. Some programs are more amenable to reproducibility, while others offer insights through qualitative and other methods. Among the programs that depend on quantitative methods to assess hypotheses and models, if they are to continue to progress through increased consensus through the accumulation of reliable facts and methods, it is essential for researchers to take computational reproducibility seriously. Computers have become a central field and laboratory instrument for much of our work, so we have an ethical duty to document how we change our data as it flows through silicon just as carefully as we document the operating parameters of a mass spectrometer or any other field or laboratory instrument. However, the current state of quantitative archaeology, with most researchers not using open source code, is comparable to the secrecy of alchemy prior to the emergence of chemistry. Abandoning this habit of secrecy in favour of transparency and reproducibility is vital if we are to avoid a future where our journals are filled with pretty pictures depicting methods that the reader has no hope of repeating or adapting in their own work. Computational reproducibility must be considered a minimum requirement for evaluating the integrity and usefulness of quantitative results.
Computational reproducibility is not a panacea; it should not be used as a universally accepted criterion for research quality (Leonelli, 2018). Results that are fully reproducible can contain errors and fraud. It is no guarantee of code quality, or that statistics have been used appropriately (Crema, 2025; cf. Vaiglova, 2025), or that data management is consistent with FAIR and CARE principles (Carroll et al., 2021). It is also time-consuming for authors to ensure their computational work can be reproduced, and for reviewers to evaluate. This is especially the case for papers reporting results generated by long-running simulation or deep learning. These may be impracticable to fully reproduce by a typical peer reviewer who does not have access to specialised computing facilities. In a professional environment where job security and career progression is often associated with pressure to publish many high-impact papers, demands for authors to spend time on reproducibility, resulting in less time for publishing more papers, may seem frustrating (Edwards and Roy, 2017; Hagstrom, 1965). This may seem especially unfair to early career researchers on short-term contracts, who may feel the goalposts for career success are being moved and that they are being asked to do more work their graduate training has not prepared them for. This highlights the need for a culture shift among senior archaeological scientists to value reproducibility in hiring and promotion decision-making. This is important for updating the alignment of quantitative archaeology with normative ideals of scientific practice, such as communal sharing and organized skepticism (Merton, 1973). Professors must contribute to this shift by nurturing a culture of reanalysis and reproducibility in their teaching, for example by using replication assignments and by training students in the most current best practices and tools for reproducible research, such as R and Python (Dogucu, 2025; Marwick et al., 2020). A key challenge for the future is changing the dominant habitus (e.g. dispositions, skills, and ways of perceiving) of senior scholars in gatekeeping positions so that reproducibility work will be recognized and rewarded with the same level of symbolic capital afforded to novel high-impact, highly cited publications (Bourdieu, 1988).
Versions of this paper were presented at the Summer School on Reproducible Research In Landscape Archaeology at the Freie Universität Berlin and Christian-Albrechts-Universität zu Kiel (2017), the Big Data in Archaeology Conference at the McDonald Institute for Archaeological Research at the University of Cambridge (2019) and the Workshop on Exploring Data-Driven Solutions to Archaeological Problems at the Abu Dhabi Institute at New York University (2025). Thanks to the participants of those events for their feedback. Thanks to the JAS editors for the invitation to contribute to this special issue.
The data that support the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.14897252